Method, system, and computer program product for adaptive congestion control on virtual lanes for data center ethernet architecture

ABSTRACT

Congestion is adaptively controlled in a data center Ethernet (DCE) network. Packets are received over at least one virtual lane in the DCE network. An absolute or relative packet arrival rate is computed over a time period. The absolute or relative packet arrival rate is compared to at least a first threshold and a second threshold. If the absolute or relative packet arrival rate increases beyond the first threshold, the packet transmission rate is caused to decrease. If the absolute or relative packet arrival rate is less than a second threshold, the packet transmission rate is caused to increase.

TRADEMARKS

IBM® is a registered trademark of International Business MachinesCorporation, Armonk, N.Y., U.S.A. Other names used herein may beregistered trademarks, trademarks or product names of InternationalBusiness Machines Corporation or other companies.

BACKGROUND

The present invention relates generally to traffic control, and, inparticular, to adaptive congestion control.

Data Center Ethernet (DCE) is an emerging industry standard whichproposes modifications to existing networks, in an effort to positionEthernet as the preferred convergence fabric or all types of data centertraffic. A recent study has found that Ethernet is the convergencefabric, with I/O consolidation in a Data Center as shown in FIG. 1. Thisconsolidation is expected to simplify platform architecture and reduceoverall platform costs. More details of proposals for consolidation aredescribed in “Proposal for Traffic Differentiation in EthernetNetworks,” which may be found athttp://www.ieee802.org/1/files/public/docs2005/new-wadekar-virtual%20-links-0305.pdf.

Major changes have been proposed for DCE (also referred to as enhancedEthernet and low latency Ethernet), including the addition of creditbased flow control at the link layer, congestion detection and data ratethrottling, and the addition of virtual lanes with quality of servicedifferentiation. It is important to note that these functions do notaffect Transmission Control Protocol/Internet Protocol (TCP/IP), whichexists above the DCE level. It should also be noted that DCE is intendedto operate without necessitating the overhead of TCP/IP. This offers amuch simpler, low cost approach that does not require offload processingor accelerators.

Implementation of DCE will require a new DC compatible network interfacecard at the server, storage control unit, and Ethernet switch, mostlikely capable of 10 Gigabit data rates. There are server relatedarchitectural efforts, including low latency Ethernet for highperformance servers and encapsulation of various other protocols in aDCE fabric to facilitate migration to a converged DCE network over thenext several years. This new architecture for data center networkspresents many technical challenges.

Conventional Ethernet networks running under TCP/IP are allowed to dropdata packets under certain conditions. These networks are known as “besteffort” or lossy networks. Networks using other protocols, such asAsynchronous Transfer Mode (ATM), also use this approach. Such networksrely on dropped packets for detecting congestion. In a network usingTCP/IP, the TCP/IP software provides a form of end-to-end flow controlfor such networks. However, recovery from packet dropping can incur asignificant latency penalty. Furthermore, any network resources alreadyused by packets that have been dropped are also wasted. It has been wellestablished that enterprise data center environments require a losslessprotocol that don't drop packets unless the packets are corrupted. Also,an enterprise data center environment requires a much faster recoverymechanisms, such as Fiber Channel Protocol, InfiniBand, etc. Losslessnetworks prevent buffer overflows, offer faster response time to recovercorrupted packets, do not suffer from loss-induced throughputlimitations and allow burst traffic flow to enter the network withoutdelay, at full bandwidth. It is important to note that these functionsdo not affect TCP/IP, which is above the DCE level. Some other form offlow control and congestion resolution is needed to address theseconcerns.

Networks using credit based flow control are subject to congestion “hotspots”. This problem is illustrated in FIGS. 2A-2D. The exampleillustrated in these figures shows a switch fabric with three layers ofcascaded switching (switch layer 1, switch layer 2, and switch layer 3)and their associated traffic flows. While three switch layers are shownfor simplicity of illustration, it should be appreciated that a switchfabric may contain many more switch layers.

In FIG. 2A, traffic flows smoothly without congestion. However, as shownin FIG. 2B, if a sufficient fraction of all the input traffic targetsthe same output port, that output link may saturate, forming a “hotspot” 210. This causes the queues on the switches feeding the link tofill tip. If the traffic pattern persists, available buffer space on theswitches may be exhausted. This, in turn, may cause the previous stageof switching to saturate its buffer space, forming additional hot spots220 and 230 as shown in FIG. 2C. The congestion eventually may back upall the way to the network input nodes, forming hot spots 240-256. Thisis referred to as congestion spread or tree saturation. One or moresaturation trees may develop at the same time and spread through thenetwork very quickly. In a fully formed saturate tree, every packet mustcross at least one saturated switch on its way through the network. Thenetwork, as a whole, can suffer a catastrophic loss of throughput as aresult.

There have been several proposed solutions to this problem. One proposedsolution involves detecting potential buffer overflow condition at theswitch and broadcasting a message downstream to the destination, thenback to the source, requesting that the data rate be throttled back.This approach takes time. Also, it relies on a preset threshold in theswitch for detecting when a buffer is nearing saturation. Bursts oftraffic may cause the switch to exceed its threshold level quickly andto die down again just as quickly. A single threshold based on trafficvolume is unable to compensate fast enough under these conditions.

Many other conventional schemes require some a priori knowledge of wherethe congestion point is located. These schemes only work well fortraffic patterns that are predictable and are not suited for mixedtraffic having unpredictable traffic patterns.

Another common workaround involves allocating excess bandwidth orover-provisioning the network to avoid hotspot formation. However,over-provisioning does not scale well as the number of network nodesincreases and is an expensive solution as data rates approach 10 Gbit/s.Furthermore, DCE is intended to mix different data traffic patterns(voice, storage, streaming video, ad other enterprise data) onto asingle network. This makes it much more likely that DCE will encounterhotspot congestion, since the traffic pattern is less predictable.

SUMMARY

According to an exemplary embodiment, a method, system, and computerprogram product are provided for adaptive congestion control in a DataCenter Ethernet (DCE) network. Packets are received over at least onevirtual lane in the DCE network. An absolute or relative packet arrivalrate is computed over a time period. The absolute or relative packetarrival rate is compared to at least a first threshold and a secondthreshold. If the absolute or relative packet arrival rate exceeds thefirst threshold, the packet transmission rate is caused to decrease. Ifthe absolute or relative packet arrival rate is less than a secondthreshold, the packet transmission rate is caused to increase.

BRIEF DESCRIPTION OF THE DRAWINGS

Referring to the exemplary drawings, wherein like elements are numberedalike in the several Figures:

FIG. 1 illustrates proposed consolidation of traffic in a Data CenterEthernet (DCE) network;

FIGS. 2A-2D illustrate congestion “hot spots” that occur in conventionalcredit-based flow control networks.

FIG. 3 illustrates a method for adaptive congestion control according toan exemplary embodiment.

FIG. 4A illustrates a change in offsets between transmission of packetsand receipt of packets.

FIG. 4B illustrates a method for adaptive congestion control accordingto another embodiment.

FIG. 5 illustrates an exemplary system for adaptive congestion controlaccording to an exemplary embodiment.

FIG. 6 illustrates an exemplary system for implementing adaptivecongestion control using a computer program product according toexemplary embodiments.

DETAILED DESCRIPTION

According to an exemplary embodiment, reliability at the link layer in alarge Data Center Ethernet (DCE) network is enhanced. In one embodiment,the packet arrival rate is dynamically computed and compared tothresholds, instead of simply counting the total number of accumulatedpackets in a switch buffer. This allows a potential congestion conditionto be more quickly detected and to be responded to as appropriate. Incases where the congestion builds up slowly, this approach may also waituntil it becomes necessary to throttle back the packet transmissionrate. This approach also allows recovery to be performed more quicklyafter the congestion has passed, and the links are throttled back up totheir full operating rates.

According to an exemplary embodiment, in order to prevent droppedpackets, each packet is assigned a packet sequence number (PSN). In oneembodiment, the PSN may include 24 bits reserved in the packet headerwith an optional 3 bit session identifier. Schemes to assign thesenumbers and to re-initialize the sequence of valid PSNs when a link isre-established are described, e.g., in commonly assigned U.S. patentapplication Ser. No. 11/426,421, herein incorporated by reference.

FIG. 3 illustrates a method for adaptive congestion control in a DCEnetwork according to an exemplary embodiment. A packet is received at aswitch over a virtual lane at step 310. A determination is made witherthe packet has a valid PSN at step 320. This determination may be made,e.g., in a switch (Such as the switch 510 a shown in FIG. 5). If thepacket does not bear a valid PSN, an error is generated at step 325, andthe process returns to step 310. If the packet does bear a valid PSN, adetermination is made whether a counter timer is running at step 330. Ifnot, a counter timer is started and incremented by one at step 340. If acounter timer is running, it is incremented by one at step 345. Thecounter timer is incremented when each successive packet bearing a validPSN arrives. At step 350, the absolute packet arrival rate is computed.There is no need to check for sequential PSNs at this point, since onlythe packet arrival rate is being measured. The absolute packet arrivalrate may be computed over a fixed interval of time or a variable lengthtime window. The absolute packet arrival rate is compared to variousthresholds at steps 360-368. This comparison may be performed in theswitch. If the absolute packet arrival rate is determined at step 362 toexceed a threshold level indicating that the packet arrival rate isincreasing quickly, a message is sent, e.g., from the switch, to asource node (e.g., the source node 520 shown in FIG. 5) to throttle downthe packet transmission rate at step 372. If the absolute packet arrivalrate is determined at step 364 to exceed a lower threshold, indicatingthat the packet arrival rate is increasing slowly, the input may bethrottled down alter waiting a predetermined amount of time at step 374.The process may also be reversed, so that if the absolute packet arrivalrate is determined to be less than a predetermined threshold at step366, indicating that the packet arrival rate is decreasing slowly, acommand can be sent to the source node to increase the packettransmission rate after a predetermined amount of time at step 376. Ifthe absolute packet arrival rate is determined to be less than a lowerthreshold at step 368, indicating that the packet arrival rate isdecreasing quickly, the source node may be caused to increase the packettransmission rate quickly at step 378.

According to another embodiment, time stamping of DCE packet headers maybe used instead of a counter to determine a relative packet arrival ratefor use in comparison with thresholds. FIG. 4A illustrates how there maybe changes (delta) in the offset from the time a packet is transmitteduntil it is received over time for various packets. The offset betweenpacket transmission times and packet receipt times may be measured overa period of time and used as an indication of a relative packettransmission rate. The offset may be measured by detecting a time stampput on a packet header at the time of transmission (e.g., from a switch)indicating a time of transmission of the packet and determining a timeat which the packet is received (e.g., at another switch). The offset isthe difference in time between the time of transmission and the time ofreceipt. To account for latency, the time stamp may be put on the headerof the packet as it exits a node, e.g., a switch. Changes in the offsetmay indicate whether the packet arrival rate is increasing or decreasingover a period of time. The relative packet arrival rate may be computedbased on the offsets between packet transmission times and packetarrival times, and measures may be taken to cause the packet arrivalrate to increase or decrease by comparing the computed relative packetarrival rate to various thresholds as explained below with reference toFIG. 4B. This allows a centralized manager within the network todetermine congestion points or whether the source packet injection rateis just slow. This embodiment does not require synchronization ofreceive and transmit clocks anywhere in the network (including thesource, destination and internal nodes and switches), as only offsetsbetween transmission and arrival times are used to compute a relativepacket arrival rate. Thus, it is possible that the offset in timebetween transmission of a packet and receipt of a packet may be anegative value.

FIG. 4B illustrates a method for adaptive congestion control in a DCEnetwork using a time stamp approach according to an exemplaryembodiment. A packet is received at a switch over a virtual lane at step410. An offset from the time the packet was transmitted and the time thepacket is received is determined at step 420 as described above. At step440, the relative packet arrival rate is computed based on the offsetsbetween transmission times and arrival times for a number of packetsover a period of time. The relative packet arrival rate may be computedover a fixed interval of time or a variable length time window. Similarto the process shown in FIG. 3, the relative packet arrival rate iscompared to various thresholds at steps 460-468. This comparison may beperformed in the switch. If the relative packet arrival rate isdetermined at step 462 to exceed a threshold level indicating that thepacket arrival rate is increasing quickly, a message is sent a sourcenode (e.g., the source node 520 shown in FIG. 5) to throttle down thepacket transmission rate at step 472. If the relative packet arrivalrate is determined at step 464 to exceed a lower threshold, indicatingthat the relative packet arrival rate is increasing slowly, the inputmay be throttled down after waiting a predetermined amount of time atstep 474. The process may also be reversed, so that if the relativepacket arrival rate is determined to be less than a predeterminedthreshold at step 466, indicating that the packet arrival rate isdecreasing slowly, a command can be sent to the source node to increasethe packet transmission rate after a predetermined amount of time atstep 476. If the relative packet arrival rate is determined to be lessthan a lower threshold at step 468, indicating that the relative packetarrival rate is decreasing quickly, the source node may be caused toincrease the packet transmission rate quickly at step 478.

According to an exemplary embodiment, the processes depicted in FIGS. 3and 4B may be implemented in a switch by control logic and/or a computerprocessor performing instructions encoded on a computer readable medium,such as CD-ROM disks or floppy disks, included in a computer programproduct.

The approaches described above may be implemented on every node in anetwork at each switch. In this way, bursts of traffic that are apotential cause of congestion may be responded too more quickly. Also,recovery may be performed more quickly when the congestion passes.

FIG. 5 illustrates an exemplary system for adaptive congestion controlaccording to an exemplary embodiment. As shown in FIG. 5, packets aretransmitted between a source node 520 and a destination node 530 viaswitches 510 a and 510 b and a DCE fabric of links 540. Although twoswitches are shown in FIG. 5 for simplicity of illustration, it shouldbe appreciated that there may be many more switches. Packet arrivalrates may be measured/computed and compared with thresholds in theswitches 510 a and 510 b, as described above. The switches, in turn, maycause the source node (or the destination node, if traffic is being sentfrom the destination node) to increase/decrease the packet transmissionrate as needed.

As described above, different threshold levels may be set for differentrates of traffic increase/decrease. If the traffic transmission rate isincreased slowly, for example, there may be a pause before requestingthat the source throttle down the input data rate. Similarly, if traffictransmission rate is decreased slowly, there may be a pause beforerequesting the source to throttle up the input data rate. In this way,the maximum amount of data is kept in the pipeline for a long aspossible, making more efficient use of the available network bandwidth.The maximum allowed receive buffer allocation may be adjusted, dependingon the packet arrival rate and thresholds. Also, according to exemplaryembodiments, faster recovery from congestion conditions may be achievedin comparison to simply measuring the total number of packets. Thisenables proactive prevention of formation of congestion trees andoptimizes network throughput and efficiency.

Further, this approach may be implemented on a per-lane basis systemthat transmits several virtual traffic flows across a common high speedconnection. In this way, a burst of traffic on one virtual lane will notcause congestion for other traffic streams that share the same physicalconnection. This load balancing is particularly beneficial for mixedtraffic types. Also, the allocation of the receive buffer size may beadjusted based on increases and decreases in packet arrival rates. It iseven possible to implement a feedback loop for dynamically allocatingtraffic among different virtual lanes depending on the level ofcongestion.

As noted above, the processes depicted in FIGS. 3 and 4B may beimplemented in a switch by control logic and/or a computer processorperforming instructions encoded on a computer readable medium, such asCD-ROM disks or floppy disks, included in a computer program product. Anexemplary system for implementing the processes on a computer programproduct is shown in FIG. 6.

FIG. 6 illustrates an exemplary system for implementing adaptivecongestion control using a computer program product according toexemplary embodiments. The system includes a computer 600 in contactwith a signal bearing medium 640 via an input/output interface 630. Thesignal bearing medium 640 may include instructions for performing theadaptive congestion control techniques described above implemented as,e.g., information permanently stored on non-writable storage media(e.g., read-only memory devices within a computer, such as CD-ROM disksreadable by a CD-ROM drive), alterable information stored on a writeablestorage media (e.g., floppy disks within a diskette drive or hard-diskdrive), information conveyed to a computer by a communications medium,such as through a computer or telephone network, including wireless andbroadband communications networks, such as the Internet, etc.

The computer includes a processor 610 that executes the instructions forperforming the adaptive congestion control technique contained, e.g., onthe signal bearing medium 640 and communicated to the computer via theinput/output interface 630. The instructions for performing adaptivecongestion control may be stored in a memory 620 or may be retained onthe signal bearing medium 640.

While the invention has been described with reference to exemplaryembodiments, it will be understood by those skilled in the art thatvarious changes may be made and equivalents may be substituted forelements thereof without departing from the scope of the invention. Inaddition, many modifications may be made to adapt a particular situationor material to the teachings of the invention without departing from theessential scope thereof. Therefore, it is intended that the inventionnot be limited to the particular embodiment disclosed as the best modecontemplated for carrying out this invention, but that the inventionwill include all embodiments falling within the scope of the appendedclaims.

1. A method for adaptive congestion control in a Data Center Ethernetnetwork, comprising: receiving packets over at least one virtual lane inthe DCE network; computing an absolute or a relative packet arrival rateover a time period; comparing the absolute or relative packet arrivalrate to at least a first threshold and a second threshold; if theabsolute or relative packet arrival rate exceeds the first threshold,causing the packet transmission rate to decrease; and if the absolute orrelative packet arrival rate is less than the second threshold, causingthe packet transmission rate to increase.
 2. The method of claim 1,wherein the absolute arrival rate is computed by: determining whethereach received packet has a valid packet sequence number; and if thereceived packet has a valid packet sequence number, incrementing acounter, wherein the absolute packet arrival rate is computed based onincrements in the counter over the time period.
 3. The method of claim1, wherein the relative packet arrival rate is computed by: detecting atimestamp of each received packet indicating a time when the packet wastransmitted; and determining an offset from a time of receipt of thepacket and the time when the packet was transmitted, wherein therelative packet arrival rate is computed based on offsets in receipt ofpackets over the time period.
 4. The method of claim 1, wherein theabsolute or relative packet arrival rate is computed over a fixed timeperiod or a variable time period.
 5. The method of claim 1, furthercomprising: comparing the absolute or relative packet arrival rate to athird threshold; and if the absolute or relative packet arrival rateexceeds the third threshold but is less than the first threshold,causing the packet transmission rate to decrease after a pause.
 6. Themethod of claim 1, further comprising: comparing the absolute orrelative packet arrival rate to a fourth threshold; and if the absoluteor relative packet arrival rate is less than the fourth threshold but isnot less than the second threshold, causing the packet transmission rateto increase after a pause.
 7. The method of claim 1, wherein the stepsare performed for packets on a per-virtual-lane-basis.
 8. The method ofclaim 1, further comprising dynamically increasing or decreasing packettransmission rates on other virtual lanes depending on the computedpacket arrival rate on the at least one virtual lane.
 9. A system foradaptive congestion control in a Data Center Ethernet network,comprising: a transmitter for transmitting packets over at least onevirtual lane in the DCE network; a receiver for receiving thetransmitted packets from the transmitter; and a switch interspersedbetween the transmitter and the receiver, wherein the switch receivesthe packets from the transmitter and computes an absolute or a relativepacket arrival rate over a time period and compares the absolute orrelative packet arrival rate to at least a first threshold and a secondthreshold, wherein if the absolute or relative packet arrival rateexceeds the first threshold, the switch causes the packet transmissionrate to decrease, and if the absolute or relative packet arrival rate isless than the second threshold, the switch causes the packettransmission rate to increase.
 10. The system of claim 9, wherein theswitch further determines whether each received packet has a validpacket sequence number, and if the received packet has a valid packetsequence number, the switch increments a counter, wherein the switchcomputes the absolute packet arrival rate based on increments in thecounter over the time period.
 11. The system of claim 9, wherein theswitch detects a timestamp of each received packet indicating a timewhen the packet was transmitted and determines an offset from a time ofreceipt of the packet and the time when the packet was transmitted,wherein the step of computing the relative packet arrival rate is basedon offsets in receipt of packets over the time period.
 12. The system ofclaim 9, wherein the absolute or relative packet arrival rate iscomputed over a fixed time period or a variable time period.
 13. Thesystem of claim 9, wherein the switch further compares the absolute orrelative packet arrival rate to a third threshold, and if the absoluteor relative packet arrival rate exceeds the third threshold but is lesshall the first threshold, the switch causes the packet transmission rateto decrease after a pause.
 14. The system of claim 9, wherein the switchfurther compares the absolute or relative packet arrival rate to afourth threshold, and if the absolute or relative packet arrival rate isless than the fourth threshold but is not less than the secondthreshold, the switch causes the packet transmission rate to increaseafter a pause.
 15. A computer program product for adaptive congestioncontrol in a Data Center Ethernet network, comprising a computer usablemedium having a computer readable program, wherein the computer readableprogram, when executed on a computer, causes the computer to: compute anabsolute or a relative packet arrival rate over a time period; comparethe absolute or relative packet arrival rate to at least a firstthreshold and a second threshold; if the absolute or relative packetarrival rate exceeds the first threshold, cause the packet transmissionrate to decrease; and if the absolute or relative packet arrival rate isless than the second threshold, cause the packet transmission rate toincrease.
 16. The computer program product of claim 15, wherein theabsolute arrival rate is computed by: determining whether each receivedpacket has a valid packet sequence number; and if the received packethas a valid packet sequence number, incrementing a counter, wherein theabsolute packet arrival rate is computed based on increments in thecounter over the time period.
 17. The computer program product of claim15, wherein the relative packet arrival rate is computed by: detecting atimestamp of each received packet indicating a time when the packet wastransmitted; and determining an offset from a time of receipt of thepacket and the time when the packet was transmitted, wherein therelative packet arrival rate is computed based on offsets in receipt ofpackets over the time period.
 18. The computer program product of claim15, wherein the absolute or relative packet arrival rate is computedover a fixed time period or a variable time period.
 19. The computerprogram product of claim 15, wherein the computer readable mediumfurther includes instructions that, when executed on a computer, causethe computer to: compare the absolute or relative packet arrival rate toa third threshold; and if the absolute or relative packet arrival rateexceeds the third threshold but is less than the first threshold, causethe packet transmission rate to decrease after a pause.
 20. The computerprogram product of claim 15, wherein the computer readable mediumfurther includes instructions that, when executed on a computer, causethe computer to: compare the absolute or relative packet arrival rate toa fourth threshold; and if the absolute or relative packet arrival rateis less than the fourth threshold but is not less than the secondthreshold, cause the packet transmission rate to increase after a pause.